Statistical Fault Detection for Parallel Applications with AutomaDeD

نویسندگان

  • Greg Bronevetsky
  • Ignacio Laguna
  • Saurabh Bagchi
  • Bronis R. de Supinski
  • Dong H. Ahn
  • Martin Schulz
چکیده

Today’s largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but are difficult to detect and diagnose. This paper presents AutomaDeD, a statistical tool that models the timing behavior of each application task and tracks its behavior to identify any abnormalities. If any are observed, AutomaDeD can immediately detect them and report to the system administrator the task where the problem began. This identification of the fault’s initial manifestation can provide administrators with valuable insight into the fault’s root causes, making it significantly easier and cheaper for them to understand and repair it. Our experimental evaluation shows that AutomaDeD detects a wide range of faults immediately after they occur 80% of the time, with a low false-positive rate. Further, it identifies weaknesses of the current approach that motivate

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Thau Observer for Fault Detection of Micro Parallel Plate Capacitor Subjected to Nonlinear Electrostatic Force

This paper investigates the fault detection of a micro parallel plate capacitor subjected to nonlinear electrostatic force. For this end Thau observer, which has good ability in fault detection of nonlinear system has been presented and governing nonlinear dynamic equation of the capacitor has been presented. Upper and lower threshold for fault detection have been obtained. The robustness of th...

متن کامل

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...

متن کامل

Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs

Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...

متن کامل

A new technique for bearing fault detection in the time-frequency domain

This paper presents a new Fast Kurtogram Method in the time-frequency domain using novel types of statistical features instead of the kurtosis. For this study, the problem of four classes for Bearing Fault Detection is investigated using various statistical features. This research is conducted in four stages. At first, the stability of each feature for each fault mode is investigated. Then, res...

متن کامل

Fault Detection and Classification in Double-Circuit Transmission Line in Presence of TCSC Using Hybrid Intelligent Method

In this paper, an effective method for fault detection and classification in a double-circuit transmission line compensated with TCSC is proposed. The mutual coupling of parallel transmission lines and presence of TCSC affect the frequency content of the input signal of a distance relay and hence fault detection and fault classification face some challenges. One of the most effective methods fo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011